Are graph databases ready for bioinformatics?
نویسندگان
چکیده
Graphs are ubiquitous in bioinformatics and frequently consist of too many nodes and edges to represent in RAM. These graphs are thus stored in databases to allow for efficient queries using declarative query languages such as SQL. Traditional relational databases (e.g. MySQL and PostgreSQL) have long been used for this purpose and are based on decades of research into query optimization. Recently, NoSQL databases have caught a lot of attention due to their advantages in scalability. The term NoSQL is used to refer to schemaless databases such as key/value stores (e.g. Apache Cassandra), document stores (e.g. MongoDB) and graph databases (e.g. AllegroGraph, Neo4J, OpenLink Virtuoso), which do not fit within the traditional relational paradigm. Most NoSQL databases do not have a declarative query language. The widely used Neo4J graph database is an exception. Its query language Cypher is designed for expressing graph queries, but is still evolving. Graph databases have so far seen only limited use within bioinformatics [Schriml et al., 2013]. To illustrate the pros and cons of using a graph database (exemplified by Neo4J v1.8.1) instead of a relational database (PostgreSQL v9.1) we imported into both the human interaction network from STRING v9.05 [Franceschini et al., 2013], which is an approximately scale-free network with 20,140 proteins and 2.2 million interactions. As all graph databases, Neo4J stores edges as direct pointers between nodes, which can thus be traversed in constant time. Because Neo4j uses the property graph model, nodes and edges can have properties associated with them; we use this for storing the protein names and the confidence scores associated with the interactions. In PostgreSQL, we stored the graph as an indexed table of node pairs, which can be traversed with either logarithmic or constant lookup complexity depending on the type of index used. On these databases we benchmarked the speed of Cypher and SQL queries for solving three bioinformatics graph processing problems: finding immediate neighbors and their interactions, finding the best scoring path between two proteins, and finding the shortest path between them. We have selected these three tasks because they illustrate well the strengths and weaknesses of graph databases compared to traditional relational databases. A common task in STRING is to retrieve a neighbor network. This involves finding the immediate neighbors of a protein and all interactions between them. To express this as a single SQL query requires the use of query nesting and a UNION set operation. Because Cypher currently supports neither of these features, two queries are needed to solve the task: one to find immediate neighbors and a second to find their interactions, which must be run for each
منابع مشابه
A graph layout algorithm for drawing metabolic pathways
MOTIVATION A large amount of data on metabolic pathways is available in databases. The ability to visualise the complex data dynamically would be useful for building more powerful research tools to access the databases. Metabolic pathways are typically modelled as graphs in which nodes represent chemical compounds, and edges represent chemical reactions between compounds. Thus, the problem of v...
متن کاملPAnnBuilder: an R package for assembling proteomic annotation data
PAnnBuilder is an R package to automatically assemble protein annotation information from public resources to provide uniform annotation data for large-scale proteomic studies. Sixteen public databases have been parsed and 54 annotation packages have been constructed based on R environment or SQLite database. These ready-to-use packages cover most frequently needed protein annotation for three ...
متن کاملFitchi: haplotype genealogy graphs based on the Fitch algorithm
UNLABELLED : In population genetics and phylogeography, haplotype genealogy graphs are important tools for the visualization of population structure based on sequence data. In this type of graph, node sizes are often drawn in proportion to haplotype frequencies and edge lengths represent the minimum number of mutations separating adjacent nodes. I here present Fitchi, a new program that produce...
متن کاملThe global trace graph, a novel paradigm for searching protein sequence databases
MOTIVATION Propagating functional annotations to sequence-similar, presumably homologous proteins lies at the heart of the bioinformatics industry. Correct propagation is crucially dependent on the accurate identification of subtle sequence motifs that are conserved in evolution. The evolutionary signal can be difficult to detect because functional sites may consist of non-contiguous residues w...
متن کاملA redundancy detection approach to mining bioinformatics data
This paper is concerned with the search for sequences of bases via the key equivalence problem. The approach is related to the hardening of soft databases method due to Cohen et al., [4]. Here, the problem is described in graph theoretic terms. An appropriate optimization model is drawn and solved indirectly. This approach is shown to be effective. Computational results on test databases are in...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره 29 شماره
صفحات -
تاریخ انتشار 2013